Move 3: Present the Present Work |
In Section 2, we start with transcription and coding, where conflicting judgments between experts or evaluators quite often show up. The degree of conflict can be made clear by calculating agreement indices. Moreover, we will show how data on which disagreement occurs ought to be dealt with in the analysis. The statistical analysis of frequency data is the central topic of Section 3. Basically, the analysis of this type of data is fairly straightforward. The primary technique is v2 analysis, a technique explained in introductory textbooks on statistics. An important assumption of v2 analysis and equivalent statistics is the independence of observations, and precisely this assumption is problematic in corpus research. We show how two kinds of dependences may interfere in the statistical analysis, both resulting in a Type I error which is too high; (that is to say that) the significance of an effect is claimed too often where in fact there is no effect. Section 4 deals with two other well-known problems in v2 analysis, viz. the effects of small and large samples. Small samples tend to yield few significant effects, while the ‘high significance’ levels obtained with large samples are often incorrectly interpreted as indicators of substantial effects. For small samples the concept of power is relevant. For large samples, we need an index which expresses the size of an effect, independently from the sample size. In Section 5, we discuss the use of the log odds ratio as an alternative to v2 analysis. Its use is still quite rare in corpus analysis, although it has outstanding statistical properties. Log odds form the basis of attractive multivariate techniques, such as logit analysis and logistic regression.
|